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Explicit codes are often used to simulate the nonlinear dynamics of large-scale 
structural systems, even for low frequency response, because the storage and CPU re- 
quirements entailed by the repeated factorizations traditionally found in implicit codes 
rapidly overwhelm the available computing resources. With the advent of parallel pro- 
cessing, this trend is accelerating because explicit schemes are also easier to parallelize 
than implicit ones. However, the time step restriction imposed by the Courant stabil- 
ity condition on all explicit schemes cannot yet — and perhaps will never — be offset 
by the speed of parallel hardware. Therefore, it is essential to develop efficient and 
robust alternatives to direct methods that are also amenable to massively parallel pro- 
cessing because implicit codes using unconditionally stable time-integration algorithms 
are computationally more efficient than explicit codes when simulating low-frequency 
dynamics. Here we present a domain decomposition method for implicit schemes that 
requires significantly less storage than factorization algorithms, that is several times 
faster than other popular direct and iterative methods, that can be easily implemented 
on both shared and local memory parallel processors, and that is both computation- 
ally and communication-wise efficient. The proposed transient domain decomposition 
method is an extension of the method of Finite Element Tearing and Interconnecting 
(FETI) developed by Farhat and Roux for the solution of static problems. Serial and 
parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are 
reported and analyzed for realistic structural dynamics problems. These results estab- 
lish the superiority of the FETI method over both the serial/parallel conjugate gradient 
algorithm with diagonal scaling and the serial/parallel direct method, and contrast the 
computational power of the iPSC-860/128 parallel processor with that of the CRAY 
Y-MP/8 system. 
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[4], Belytschko, Plaskacz, Kennedy and Greenwell [5], and Biffle [6]) because 
these methodologies (a) axe easier to parallelize than implicit schemes and direct 
solvers, and (b) they usually induce short range interprocessor communications 
that are relatively inexpensive, while the factorization methods used in most im- 
plicit schemes induce long range interprocessor communications that often ruin 
the sought-after speed-up. However, the time step restriction imposed by the 
Courant stability condition on all explicit schemes cannot yet — and perhaps will 
never — be offset by the speed of parallel hardware, and many iterative solvers 
often fail when applied to systems arising from the analysis of large-scale flexible 
structures. Therefore, it is essential to develop efficient and robust alternatives to 
direct methods that are also amenable to massively parallel processing, because 
unconditionally stable implicit schemes are computationally more efficient than 
explicit time-integration methodologies at simulating low-frequency dynamics. 
The EBE/PCG algorithm developed by Hughes and his co-workers [1, 2] is such 
an alternative which additionally utilizes the structure inherent in finite element 
formulations and implementations. Here, we propose another alternative which is 
driven by substructuring concepts and which achieves a greater improvement over 
the CG algorithm with diagonal scaling than reported in [1] for the EBE/PCG 
scheme. However, unli ke EBE methods, the proposed methodology requires the 
assembly and factorization of substructure level matrices and therefore requires 
more storage than the EBE/PCG solution algorithm. 

A good balance between direct and iterative solution algorithms is provided 
by Domain Decomposition (DD) methods which usually blend both of these so- 
lution strategies. A well designed DD method requires less storage than direct 
solvers and converges faster than other purely iterative algorithms when applied to 
highly ill-conditioned systems (see, for example, the collection of papers compiled 
in [7]). Moreover, in their simplest form, DD methods have an explicit character 
because they effectively share information only between neighboring subdomains, 
which makes them very attractive for parallel processing. The method of Fi- 
nite Element Tearing and Interconnecting (FETI) is a DD algorithm based on a 
hybrid variational principle that was developed by Farhat and Roux [8] for the 
parallel solution of self-adjoint elliptic partial differential equations. It combines 
a direct solver for computing the incomplete subdomain displacement fields with 
a PCG algorithm for extracting the dual tractions at the subdomain interfaces. 
This method was shown to outperform optimized direct solvers on both serial 
and coarse-grained multiprocessors such as the CRAY Y-MP/8 system, and to 
compare favorably with other DD algorithms on a 32 processor iPSC-2 (Farhat 
and Roux [9]). In this paper, we present an extension of the FETI methodology 
to structural dynamics problems. In particular, we construct two subdomain-by- 
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stress and strain tensors, p s is the mass density, u is the displacement field, / is 
the forcing field, and A*’* is a Lagrange multiplier function which represents the 
interface tractions that maintain equilibrium between 12* and a neighboring Q q 
(interconnecting). The first of the inter- substructure continuity constraints (Eqs. 
(4)) can be dualized as: 


/ 6A*’ 9 (u* - u ? ) dT = 0 (5) 

Jn‘ p| 

If the original mesh of the global structure does not contain incompatible ele- 
ments, the subdomains {12*}*^f* are guaranteed to have compatible interfaces 
since these subdomains are obtained by partitioning the global mesh into N s 
submeshes. Therefore, we assume in the sequel that discrete Lagrange multipli- 
ers are introduced at the subdomain interfaces to enforce the inter-substructure 
displacement continuity. The piece-wise continuous approximation of the La- 
grange multipliers X t,q in Eq. (5) is treated in Farhat and Geradin [10] for static 
problems. Using a standard Galerkin procedure where the displacement field is 
approximated by suitable shape functions as: 

u* = Nu’ (6) 


and linearizing the equations of dynamic equilibrium around u n +i, Eqs. (3) and 
(5) are transformed into the following algebraic system: 


M*Aii* ( lt l) +K t ‘Auf +,) 


*n+l ~r iv uu n+l 

s= N, 

£ B ' Au » 

5=1 


= r 


n+ 1 


•p»* T \(*+U 
- U A n+1 


5 = 1 , 


N. 


<*+») 

+1 


= 0 


( 7 ) 

where the superscript T indicates a transpose, M 5 andJK t are respectively the 
subdomain mass and tangent stiffness matrices, Au* +1 and r* +1 are respec- 
tively the subdomain vector of nodal displacement increments and the subdo- 
main vector of out-of-balance nodal forces, B a is a boolean matrix that describes 
how Q, 3 is connected to the global interface, and A^*^ is the vector of Lagrange 
multiplier unknowns at iteration k + 1 and step n + 1. For linear elastostatic 
problems, Eqs. (7) above correspond to the discretization of a saddle-point vari- 
ational principle whose mathematical and computational properties are analyzed 
in Farhat and Roux [8, 9]. 
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displacement field increment Au^j^ and the momentum increment 
AvK 1 = M AulV,"- 

Here, we select a midpoint rule version of the time-integration algorithm 
(A3) and summarx ze it in order to keep this paper self-contained. Let Av^ +X 
denote the momentum increment at iteration k + 1 and at the midpoint between 
steps n and n + 1: 

Av (t+ , 1) = M Au 




n +i 


( 9 ) 


The midpoint subdomain increments Au* ( +X ) and Av^ +1 * are integrated as 
follows: 


(*+i) At (k+D 

Au ,i = — Au' i 

n +7 2 


,(*+1) A t (w) 
Av*„ +i = — Av! 


( 10 ) 




where At is the time step. Substituting Eqs. (10) into the dynamic equations of 
subdomain equilibrium (7) written at the midpoint step n + | leads to: 


Av 


.<*+») 

»+* 


2 

At 


- .... „<*+») 

= — M Au 


*+i 


( 11 ) 


and 


(M- + ^K--)A< ( ;r> = . = 1 N , 


At 


4 

E b ' a <T = 0 

3=1 


( 12 ) 


u 


,<*+*) 

n-f X 


.(*) , a a { * +1) 

u „ + l + Au „+X 


After Eqs. (12) are repeatedly solved for all nonlinear increments, u ^ + i found 
and the time derivative of each subdomain momentum could be obtained via 
back-substitution as: 




= r::\ - b* t a - f* iB, (u* llPc ,^) 


[ n+i 


(13) 


However, in order to bypass the dynamics of the Lagrange multipliers we implic- 

3 — A/j . . 

itly invoke the equilibrium condition B* A^ = 0 and directly compute 

3=1 

the momentum increment in assembled form as: 


'n + X = 


(14) 
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subdomain displacement fields [13]. Because of this compactness, the eigenval- 

i— /^* i *j' _ j 

ues of the matrix £ B ' K ‘ B * 0X6 weU separated and have a higher den- 
sity towards the low end of their spectrum. For a fixed At, we can show that 
a=N x 

has similar eigen properties. The objectives of this 

S-l 

section are: (a) to numerically illustrate the spectral properties of the transient 
interface operator Y2 ) 1 B J , and (b) to highlight the impact 

j=i 

of these spectral properties on the convergence rate of the CG algorithm. 


Consider the three-dimensional two-subdomain cantilever problem depicted 
in FIG. 1. The beam has a square cross section and a 10/1 length/height aspect 
ratio. Two finite element meshes are constructed using 8-node brick elements. 
The first mesh, Mi, contains 2400 internal degrees of freedom (d.o.f.) and 192 
interface d.o.f. The second mesh, M 2 , is finer: it contains 16320 internal d.o.f. 
and 672 interface d.o.f. 



FIG. 1 A three-dimensional cantilever problem 
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The x axis of the bax diagrams depicted in FIG. 2-3 represents the eigenvalues 
scaled by the smallest eigenvalue, and the y axis the number of eigenvalues per 

interval. For both meshes, ¥j is shown to have a few large eigenvalues that 
are well separated from the small ones. Figure 3 suggests that this separation 
amplifies when the mesh size h —+ 0. Indeed, one can mathematically prove that 
the large eigenvalues of Fj stabilize when the mesh size decreases, while its small 

eigenvalues accumulate towards zero. Essentially, this is because F j involves 
the inverses of the pencils (M J , K* ) and therefore its high modes correspond to 
physical modes while its low modes correspond to mesh modes. 

The spectral distribution of the interface problem associated with the FETI 
methodology has important consequences on the convergence rate of the CG 
algorithm. During the first iterations, the conjugate gradient algorithm mostly 
captures the eigenvectors associated with the large eigenvalues. Since the high 
numerical modes of Fj correspond to the low physical modes of the structure and 

since F j has only a few relatively high eigenvalues, the CG algorithm applied to 
the solution of Eq. (16) quickly gives a good approximation of the displacement 
increments. Intuitively, one can imagine that after the few relatively high modes 

of the interface operator are captured, the “effective” condition number of F ; 
— that is the ratio of its largest uncaptured eigenvalue over its smallest one — 

becomes significantly smaller than the original condition number of F/, which 
accelerates the convergence of the CG algorithm. A detailed analysis of this 
superconvergence behavior of the CG algorithm in the presence of well separated 
eigenvalues can be found in the work of van der Sluis and van der Vorst [14]. 

The impact of the spectral distribution of the transient interface operator F ^ 
on the convergence rate of the CG algorithm is highlighted in TABLE 1 which 
reports the number of iterations to achieve convergence for the FETI methodol- 
ogy described in this paper and for the classical Schur complement method. The 
transient interface operator resulting from the latter DD method (static conden 
sation) is denoted here by Sj. While both Fj and S/ are shown to have identical 

— t — t ~ t 

two-norm condition numbers (/c 2 (F / ) = k 2 (Sj)), the CG algorithm applied to F t 

is shown to converge twice as fast as when applied to Sj. This clearly demon 
strates that conditioning is not always the only factor governing the convergence 
rate of the CG algorithm. 
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FIG. 4 A four-subdomain bending plate problem 


TABLE 2 

Four-subdomain bending plate problem 


Discretization: N x N where N = ^ 
Algorithm: CG-FETI 


Convergence criterion: 


IlK (<*> ) AU ( n V, 1 ) -r(ii< i *>,)|| a 


l|r(uSx)IU 


< 10 


,-4 


h # of d.o.f. # of interface d.o.f. # of iterations 


^ 605 55 

^ 2205 105 

i 8405 205 

^ 32805 405 

129605 805 


43 

54 

74 

75 
75 
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~ 1 _ 1 ~t 

FIG. 5 Spectral density o/Lj F j for mesh M 2 


By comparing FIG. 3 and FIG. 5, the reader can observe that L / essentially 
amplifies the separation between the clusters of small and large eigenvalues of 
the interface problem, which accelerates the convergence of the CG algorithm. 
Following the reasoning outlined in Section 3, the reader can also conclude from 
FIG. 3 and FIG. 5 that after the few large eigenvalues of the interface problem 
axe captured, the “effective” condition number of Lj F ; is much smaller than 
that of F/. This “superconvergence” behavior is highlighted in TABLE 3 below 
for the three-dimensional two-subdomain cantilever problem. 
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matrix-vector products of sizes equal to the subdomain interfaces. FYom a me- 
chanical viewpoint, solving Eq. (19) corresponds to finding a set of “lumped 
interface forces that can reproduce the imposed jump of the displacement incre- 
ments. Therefore, the problem described by Eq. (20) is indeed an approximate 
inverse of the interface problem. 


5. Preconditionig with a primal operator 
5.1 Sum of the exact inverses 

For the sake of clarity and notation simplicity, we first assume that the 
mesh partition does not contain cross-points — that is, points where more than 
two subdomains intersect. After the mecha n ical interpretation of the primal 
preconditioner presented below is given in Section 5.2, the treatment of cross- 
points becomes straightforward. 

A better approximation of F 7 can be obtained by approximating the inverse 
of the s um by the sum of the exact inverses — that is, by assuming that: 


[ Y s a (M a + ^-K < *)“ 1 B * T ]“ 1 




£[B # (M' + ^K t ')- 1 B ,T ]- 1 (21) 


3 — 1 


which leads to the following preconditioner: 


D 



3 ~ N ‘ Ai 2 T 

^[B'(M* + =-K‘‘)- 1 B‘ p 1 


( 22 ) 


If the subdomain mass and stiffness matrices and the subdomain displacement 
field axe partitioned as: 


M 9 = 


[M*. 

T 


, K s = 

[K*. 


, and u* = 

uf 


1 

Wi .0 

2 

7 

[KU 

KU. 




where the subscripts j and 5 respectively refer to the internal and interface bound- 
ary degrees of freedom, the inverse of (M^ + ^-K* ) can be written as the solution 
of the partitioned matrix equations: 


M-, + ^k;, M'j + ^Kf 


LM*. +^-K* t 


ib 


Mt„ + ^-K * 6 


’ 

r a 3 

A t» 

Ai' 


I 

O 

. 


Ai. 


O 

I 


( 23 ) 
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Here, p b denotes the search direction computed during the 9 -th iteration, and 
r q b denotes the $-th residual which represents the jump of the displacement in- 
crements across the subdomain interfaces. All computations summarized in Eqs. 
(28) can be performed at the subdomain level as follows: 




Pb = 


w* — 


A / 2 

B’(M* H — — K‘~) 
4 

A / 2 

Km;, + — k;,) 




(29) 


- (M-, + ^K-,f (M- f + 


The first of Eqs. (29) can be re-formulated as a problem with Neumann boundary 
conditions (indicated between braces {}): 



The solution of this problem represents the trace on the interface boundary 
of fl* of the displacement field resulting from prescribing the interface traction 
forces . 

The second of Eqs. (29) can be re-formulated as a problem with Dirichlet 
boundary conditions (indicated between braces {}): 


M-, + 

T 

Lm ' 6 + 






(31) 


and which can be solved in two steps: 

At 2 


Solve (M* t + 


At 2 


4 K*K’ = - (M-, + 4 
At 


K‘,)f 

At* 


Evaluate zf = (M * 6 -| — — K * 6 ) z i + (Mj t -1 — K 66 )r£ 


(32) 


Therefore, the preconditioning step reformulated in Eq. (31) corresponds to find- 
ing in each subdomain the traction forces that are needed to prescribe the 
interface boundary displacement increment jump r£ . 

The mechanical interpretation of the CG algorithm with the primal precon- 
ditioner Dj is straightforward. Within each iteration, a Neumann problem is first 
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a polynomial function of the ratio between consecutive eigenvalues. 

The loss of orthogonality of the search directions during the solution of the 
interface problem (16) has a disastrous consequence on the convergence rate of 
the PCG algorithm. To remedy this problem, we introduce a reorthogonalization 
procedure within the PCG algorithm which however, in order to determine the 
new direction vector, entails at each iteration q the following additional burden: 

~t 

(a) the storage of the direction vector p ? and the product Fjp ? . 

(b) the evaluation of q dot products of the form rJ'[F/p 9 ] where [F /P 9 ] is readily 
available and 1 < j < q, and of an nj x j matrix- vector product where nj is 
the number of interface unknowns. 

Clearly, such a reorthogonalization procedure is not feasible if introduced 
during the solution via the PCG algorithm of a global finite element problem, as 
it would require unreasonable amounts of memory and CPU. However, it is quite 
affordable within the context of a domain decomposition algorithm as it applies 
only to the interface problem. In particular, the reader should note that the 
additional computational costs outlined in (b) are small compared to the cost of 
the pair of forward and backward substitutions that are required at each iteration 

q of the PCG algorithm in order to evaluate the product F jp q . 

The efficiency of the reorthogonalization procedure discussed above is high- 
lighted in TABLE 4 for the four-subdomain bending plate problem introduced in 
Section 3. 


TABLE 4 


Four-subdomain bending plate problem 


Algorithm: PCG-FETI 
Preconditioner: L j 

IlK («<*>,) Aul‘+ 1) -r(« ( n],)lb 
Convergence criterion: ||r(u^ )jj” 

Multiprocessor: iPSC-860 (4 processors) 


< ur 3 


h 

# of iterations 
(without reorth.) 

# of iterations 
(with reorth.) 

CPU 

(without reorth.) 

CPU 

(with reorth.) 

JL 

42 

22 

5.5 s. 

3.7 s. 

20 

1 

40 

66 

24 

33.2 s. 

16.5 s. 
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Prom (16) and (35), it is clear that the gradient of the interface problem (16) 
is indeed the jump in the displacement increments at the subdomain interfaces. 
Therefore, the norm of the residual of the interface problem || — GA^* +1 ^ + H|| 
gives the order of magnitude of the error in Au n+ i and not the order of magnitude 

of the global residual K*(u^ x ) Au^ 1 * — where K (u^ x ) = M + 

ALLk*. Moreover, since the condition number of K varies as 0(l/h 2 ) while the 
condition number of F j varies as 0(1 /h), the convergence criterion: 

|| — GA (fc+1) * -h H|| < e ||-GA ( * +1) ° +H|| (36) 


does not guarantee that: 


l|K (u'A) Au<‘ + + ‘>' - r(u< n *> 1 )|| < < IIK’tuS,) Au^«»° - r(u < n *> 1 )|| (37) 


We have observed that for most structural problems, the norm of the relative 
global residual (37) is typically 10 2 to 10 3 larger than the norm of the relative 
interface residual (36), which clearly indicates that the convergence of the FETI 
methodology should be based on the global criterion (37). 

Unfortunately, evaluating (37) at every PCG iteration q requires perform- 
ing a global matrix-vector multiply in order to compute the global residual 
K*(u ( n *^) Au^ 1 ^ — r(u^ 1 ). This would double both the computational 
and communication costs of a PCG iteration. Therefore, we had to develop an 
estimator for the global residual that is accurate and computationally economical. 

Using Eq. (12) and the partitioning introduced in Section 5.1, the restriction 
of the global residual to every subdomain Q, 9 can be written as: 


K « +1 )Au n+| -r(u 


s(t) j - 

n+17 — 


,<*+») 9 


(m;» + (Au n+1 


2 bb 
.(*+»)* 


A ,(* + 1)9 s 

Au , ) 

n +i bb' 


.(* + 1)9 


(M! l + fKi)(i<r; 

(38) 

where Au* ( ^y )? — Au® ( ^V >? is the jump in the displacement increments at the 


bb 


n+ 


1 bb 


subdomain boundary interface. Clearly, the reaction forces ( M K 3 tb ) (Au*^ 1 — 


Au^:r )( ) are zero except on those degrees of freedom that are connected to the 


«+-? 


bb 


interface ones. Moreover, equilibrium suggests that ||(M* fc -f ^j-K* t ) (Au 1 ' 


.*(* + !> 


bb 


A S ( * + 1)9 

Au 1 ,, 

”+2 bb' 


)|| and \\(M 3 bb +^fK a bb ) (Au l( ‘ + 1 1)< )|| are of the same order 

* 2 bb 2 DO 
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The discrepancy between the interface and global residuals and the accuracy 
of the proposed global residual estimator are demonstrated in FIG. 6 which re- 
ports the logarithm of the Euclidean norm of the various residuals for the four 
subdomain bending plate problem with h — 1/40 (Section 3). 



7.5. Setting the tolerance for the PCG algorithm 

Let d n+1 , u* +1 , and (u* +1 )°° denote respectively the exact solution of the 
nonlinear problem (1) at time step n + 1, the exact solution of the linearized 
problem at the k - th nonlinear iteration of time step n + 1, and the PCG solution 
of the linearized problem at the k-ih nonlinear iteration of time step n 4- 1. Our 
stopping criterion for the PCG algorithm is: 


IIK'tuS,) a- 1 


'»+4 


r(u!3.)ll 


< £ I|k'( u !>+i) Au ’ 




/ (*) • 
- r ( U >H | . 


( 44 ) 


where q denotes the PCG iteration number. It follows that: 


ll(* 


Jt+XN°° 
+ 1 / 


u 


Jt+1 


■n+lll - 


< € 


U 


n+1 


-d 


n + 1 | 


( 45 ) 
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FIG, 7 Finite element discretization of a space antenna connector 

(from a finer mesh) 


8.2 Reference serial performance results 

First, performance results are reported for the direct method [LDL T factor- 
ization) and the CG algorithm with diagonal scaling (Jacobi-PCG) applied to the 
un-decomposed problem (TABLE 5). These results will later serve as reference 
serial performance results. Memory consumption is measured in millions of 64 
bit words (MW). MFLOPS (Million FLoating-point OPerations per Second) are 
reported in order to distinguish and independently assess the numerical and im- 
plementational performances. A sparse data structure is used for the Jacobi-PCG 
algorithm. 
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FIG. 8 Growth of the interface problem 


TABLE 6 

Space antenna connector - FETI(Lj ) serial performance results 
67704 equations - At = = 0.0123s. 

Preconditioner: Lj 

„ Au ^+ 1 )- r ( u <„ t > l )|| 2 ^ 4 

Convergence criterion: , lw „|tj ... ^ iU 

ll r ( U „+i)IU 

CRAY Y-MP (single processor) 


N, 

memory 

CPU 

FAC 

MFLOPS 

FAC 

# of itr. 

CPU /itr. 
PCG 

CPU 

PCG 

MFLOPS 

PCG 

CPU 

TOT 

4 

18 MW 

19.0 s. 

165 

159 

0.43 s. 

68.4 s. 

130 

87.4 s. 

8 

16 MW 

13.6 s. 

150 

200 

0.39 s. 

78.0 s. 

110 

91.6 s. 

16 

12 MW 

9.6 s. 

130 

267 

0.36 s. 

96.1 s. 

105 

105.7 s. 

32 

10 MW 

6.7 s. 

105 

456 

0.34 s. 

155.0 s. 

90 

161.7 s. 

64 

9 MW 

5.0 s. 

85 

717 

0.33 s. 

236.6 s. 

75 

241.6 s. 
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Jacobi-PCG global algorithms. In the best case (4 subdomains), the FETI 
method with the lumping preconditioner is three times faster than the global 
direct solver and four times faster than the global Jacobi-PCG algorithm. 

• The primal preconditioner reduces the number of iterations performed by the 
l um ping preconditioner by a factor of 1.4 in the [4-16] subdomains range, and 
by a factor of 1.2 in the [32-64] subdomains range. However, a CG iteration 
with the primal preconditioner is 1.8 times more expensive than a CG itera- 
tion with the lumping preconditioner, so that the lumping preconditioner is 
overall more efficient. 

• The performance of both FETI algorithms deteriorates when the number 
of subdomain is increased (FIG. 9). We refer to this phenomenon as the 
numerical H non-scalability of the FETI methodology, Clearly, the super- 
convergence behavior discussed in Section 3 seems to disappear in the case 
of fine mesh partitions and the number of iterations seems to grow linearly 
with the interface size (FIG. 8). However, the reader should note that for 
the above structural dynamics problem, increasing the number of subdo- 
mains from 4 to 64 increases the number of iterations by a factor of 4.5, but 
increases the CPU time by a factor of 2.7 only (TABLE 6). This is because 
the cost of a PCG iteration decreases with the size of a subdomain. More* 
over, since the vector speed drops from 130 MFLOPS in the 4 subdomain 
case to 75 MFLOPS in the 64 subdomain case, increasing the number of 
subdomains from 4 to 64 actually increases the operation count of the FETI 
method by a factor of 2.7 x 75/130 = 1.6 only. 
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a tree communication algorithm [25] is used for broadcasting at each factorization 
step the pivotal column to all of the processors, and a node clustered wrap map- 
ping algorithm rather than a single degree of freedom wrap mapping algorithm 
is used. The distributed data structure described in [24] is augmented with a 
pointer array that identifies the last active column of each structural equation. 
On the CRAY Y-MP /8 system, the forward and backward substitutions are se- 
rialized. On the iPSC-860/128 multiprocessor, only the backward substitution 
is serialized. These serializations are performed because of well-known mapping, 
synchronization, and communication bottlenecks in the parallel solution of skyline 
triangular systems [23, 24]. 

The performances of the above parallel solvers are assessed with the nonlin- 
ear transient analysis of two structural systems: (a) the space antenna connector 
described in Section 8.1, and (b) a stiffened wing panel from the V22 tiltrotor 
aircraft (based on the panel described in Davis, Krishnamurthy, Stroud and Mc- 
Cleary [26]) (FIG. 10). The finite element model depicted in FIG. 10 con Mins 
9486 nodes, 9136 4-node shell elements with 6 d.o.f. per node, and a toi.d of 
54216 d.o.f. 



FIG. 10 Finite element discretization of a stiffened wing panel 
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TABLE 9 

Space antenna connector - performance results on the CRAY Y-MP/8 system 


67704 equations - At = y| 


0.0123s. 


~r 


Convergence criterion: 


FETI preconditioner: Lj 

UK‘( nl*2,) AulV; 1) -r(u ( n t i,)|| 3 


*n 4- 1 ^ ~ ” t 1 

ll r ( u i+ 1 )ll> 


< 10 


-4 


N„ 

n 3 

ff of itr. 
FETI 

# of itr. 
Jacobi-PCG 

CPU 

FETI 

CPU 

Jacobi-PCG 

CPU 

direct 

1 

1 

— 


— 

345.6 s. 

253.0 s. 

4 

4 

159 


24.3 s. 

93.9 s. 

70.3 s. 

8 

8 


3320 

13.7 s. 

50.2 s. 

39.5 s. 


TABLE 10 

Stiffened wing panel - performance results on the CRAY Y-MP/8 system 


54216 equations - At = = 0.00179s. 

~* -1 

FETI preconditioner: L j 


Convergence criterion: 


< 10 -4 


N p 

N a 

of itr. 
FETI 

of itr. 

Jacobi-PCG 

CPU 

FETI 

CPU 

Jacobi-PCG 

CPU 

direct 

1 

1 

— 

4775 

— 

418.6 s. 

259.7 s. 

4 

4 

300 

4775 

36.5 s. 

114.0 s. 

72.1 s. 

8 

8 

396 

4775 

20.9 s. 

60.8 s. 

40.6 s. 
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All algorit hms are reported to achieve good speed-ups. This is essentially because 
the CRAY Y-MP /8 system has a shared memory and a relatively small number 
of processors. For the antenna connector problem, the FETI method is shown to 
outperform the direct solver by a factor of three and the Jacobi-PCG algorithm 
by a factor of four. The wing panel structure apparently generates a stiffer 
problem: the FETI method converges with a larger number of iterations than for 
the connector problem, but still outperforms the direct solver by a factor of two 
and the Jacobi-PCG algorithm by a factor of three. 

9.2. Performance results on the iPSC-860/128 multi-processor 

The performance results measured on an iPSC-860/128 Touchstone Gamma mul- 
tiprocessor are summarized in TABLE 1 1 for the antenna connector problem and 
in TABLE 12 for the wing panel problem. A minimum of 32 and 64 processors are 
needed to accommodate the memory requirements of the FETI method and the 
direct solver, respectively. Our current implementation of the FETI algorithm on 
on this parallel processor allows only one processor per subdomain. Therefore, 
it is important to note that the FETI methodology is benchmarked here in the 
worst conditions, given that its intrinsic performance deteriorates in the presence 
of large numbers of subdomains (Section 8.3). 


TABLE 11 

Space antenna connector - performance results on the iPSC-860/128 system 


67704 equations - A t = = 0.0123s. 

-f 1 

FETI preconditioner: L/ 




.. . iik‘< 

1|(*) \ An( l +')_ )|U 
U »4-J AU n + l ‘'“-.ti' 111 ^ in 

— 4 


convergence criterion; 


)iii 


N p 

N. # of itr. 
FETI 

# of itr. 
Jacobi-PCG 

CPU 

FETI 

CPU 

Jacobi-PCG 

CPU 

direct 

32 

64 

32 456 

64 717 

3320 

3320 

144.0 s. 

144.1 s. 

229.4 s. 
144.8 s. 

462.6 s. 
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TABLE 13 

Space antenna connector - compute v.s. send v.s. receive on the iPSC-860/128 system 


algorithm 

N p 

compute 

send 

receive 

dot product 

total CPU time 

FETI (L; ) 

64 

91.1 s. 

1.0 s. 

9.0 s. 

43.0 s. 

144.1 s. 

direct (factor) 

64 

54.4 s. 

25.4 s. 

312.0 s. 

— 

728.1 s. 

direct (forward) 

64 

1.4 s. 

0.1 s. 

14.3 s. 

— 

21.5 s. 

direct (backward) 

64 

0.4 s. 

1.3 s. 

53.3 s. 

— 

193.7 s. 

direct (total) 

64 

56.2 s. 

26.8 s. 

379.6 s. 

— 

462.6 s. 


Clearly, the elapsed CPU time reported for the parallel direct solver is essentially 
spent in synchronization and interprocessor communication. On the other hand, 
the FETI method is shown to be communication efficient. The dot products 
performed during the solution of the interface problem are timed separately be- 
cause they are implemented via calls to the iNTEL system routine gdsum. We 
could not evaluate the repartition of these timings between compute, send and 
receive. If we can assume that they are equally distributed between computation 
and communication, then we can conclude that only 21.8% of the total CPU 
time reported for the FETI method is spent in synchronization interprocessor 
communi cation . 

The performance results summarized in TABLES 9-13 also demonstrate that 
the iPSC-860/128 Touchstone Gamma multiprocessor cannot compete with the 
CRAY Y-MP/8 system as far as direct solvers axe concerned. However, for 
explicit-like computations such as those characterizing the FETI and Jacobi-PCG 
algorithms, 32 processors of the iPSC-860/128 system are shown to outperform 
one CRAY Y-MP processor. 

10. Circumventing the numerical H non-scalability 

The H non-scalability property characterizing the FETI methodology and 
discussed in Section 8 requires keeping the number of subdomains relatively small 
in order to obtain the best possible performance. However, increasing the num- 
ber of subdomains while keeping the mesh size constant seems a priori attractive 
because: (a) it reduces the cost of the local problems, and (b) it also reduces the 
cost per iteration of the interface problem. Nevertheless, the results reported in 
TABLE 6 clearly indicate that refining beyond a certain number of subdomains 
the mesh partition does not significantly reduce any further the computational 
costs of neither the subdomain nor the interface problems. This can be explained 
by the fact that when the number of subdomains is increased, the subdomain 
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11. Closure 


A memory efficient domain decomposition (DD) method for implicit tran- 
sient dynamics has been presented. It is based on the concept of Finite Element 
Tearing and Interconnecting (FETI) developed by Farhat and Roux [8]. Two 
subdomain-by-subdomain preconditioners with direct mechanical interpretations 
have been formulated. The first preconditioner, called here a lumping precondi- 
tioner, is constructed as the sum of the projections on the subdomain interfaces 
of the inverses of the subdomain transient operators. The second preconditioner, 
called here a primal preconditioner, is constructed as the sum of the exact inverses 
of the subdomain transient operators. When preconditioned with the lumping op- 
erator, the interface problem behaves as if its condition number is independent 
of the mesh size. When preconditioned with the primed preconditioner, the inter- 
face problem has a condition number that is weakly dependent on the mesh size. 
Therefore, from a mathematical viewpoint, the primal preconditioner is optimal. 
However, the interface preconditioner is cheaper .and computationally more effi- 
cient. For coarse mesh partitions, we have shown that the proposed FETI method 
can outperform on the CRAY Y-MP/8 multiprocessor the parallel direct solver 
and the parallel conjugate gradient algorithm with diagonal scaling by factors of 
3 and 4, respectively. For fine mesh partitions, say more than 32 subdomains, 
the performance of the FETI algorithm degrades — and this is the case for most 
DD methods. Therefore, when working with a massively parallel processor, we 
recommend to keep the number of subdomains as small as possible and allocate 
more than one processor per subdomain. However, even in the worst case of one 
processor per subdomain, the FETI method still outperforms the parallel direct 
solver on the iPSC-860/128 system because of its low communication costs. Fi- 
nally, the results we have reported for two realistic structural dynamics problems 
indicate that 32 processors of an iPSC-860 Gamma system can outperform a 
CRAY Y-MP processor. 
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